[Feature][history server] support endpoint /events#4479
[Feature][history server] support endpoint /events#4479seanlaii wants to merge 5 commits intoray-project:masterfrom
Conversation
Signed-off-by: seanlaii <qazwsx0939059006@gmail.com>
|
|
||
| // Extract sourceHostname and sourcePid from nested events where available | ||
| extractHostnameAndPid(event, eventType, data) | ||
| } |
There was a problem hiding this comment.
Task profile event data may be silently lost
Medium Severity
The field taskProfileEvents is uniquely plural among all event data fields (others are singular like taskDefinitionEvent, actorLifecycleEvent). This naming suggests it may contain an array of profile events rather than a single map. The code only handles map[string]any type assertions - if the actual structure is []any, the type assertion silently fails, causing task profile data to not be captured in customFields and jobId extraction to fail (defaulting to "global" grouping).
Additional Locations (1)
There was a problem hiding this comment.
The concern is not valid. The taskProfileEvents field is a single object (map), not an array. Here's the evidence:
message TaskProfileEvents {
bytes task_id = 1;
int32 attempt_number = 2;
bytes job_id = 3;
ray.rpc.ProfileEvents profile_events = 4; // the "events" array is INSIDE here
}
- Actual JSON Data
{
"eventType": "TASK_PROFILE_EVENT",
"taskProfileEvents": {
"attemptNumber": 0,
"jobId": "BAAAAA==",
"taskId": "...",
"profileEvents": {
"componentId": "...",
"componentType": "worker",
"events": [...] // <-- the array is HERE, inside profileEvents
}
}
}
|
Hi @CheyuWu @troychiu @justinyeh1995 , please help review the PR when you have a chance. Thank you! |
| var response map[string]any | ||
|
|
||
| if jobID != "" { | ||
| events := s.eventHandler.ClusterEventMap.GetByJobID(clusterSessionKey, jobID) |
There was a problem hiding this comment.
just want to double-check the behavior here, /events?job_id= would also return all events. Is that intended?
There was a problem hiding this comment.
No, it should only return the events related to the specified job.


Why are these changes needed?
The Ray Dashboard provides an
/eventsendpoint that returns cluster-level events for observability. When a Ray cluster is deleted, this data is lost. This PR implements the/eventsAPI for the History Server, enabling users to query historicalRayEventsfrom deleted clusters with a response format compatible with the Ray Dashboard.Background: Two Event Systems in Ray
Ray has two distinct event systems:
event.proto) - Human-readable messages stored inlogs/events/events_base_event.proto) - Structured task/actor lifecycle data exported via the Export APIThe History Server Collector stores RayEvents (Export Events), which contain rich task/actor data. This implementation transforms RayEvents into a Dashboard-compatible response format.
Summary
This PR adds:
Event,EventMap,ClusterEventMap) intypes/event.gofollowing the existingTask/ActorpatterntransformToEvent,extractHostnameAndPid,extractJobIDFromEvent) ineventserver.go/eventshandler inrouter.gosupporting both live cluster proxy and historical data retrievalDesign Decisions
jobIdorglobal: Node/lifecycle events withoutjobIdare stored underglobalkeytypes.EventTypeconstants: Type-safe event type references instead of raw stringsAPI Response Format
Fields with Partial Availability
Unlike Dashboard Cluster Events which always populate sourceHostname and sourcePid, RayEvents only contain this information in specific nested event types:
NodeDefinitionEvent.hostname, available inNODE_DEFINITION_EVENTonlypid, or driverPid, available inTASK_LIFECYCLE_EVENT,ACTOR_LIFECYCLE_EVENT,DRIVER_JOB_DEFINITION_EVENTNOTE: This is a fundamental difference between the two event systems in Ray. Dashboard Cluster Events are designed for human-readable logging, while RayEvents are structured for programmatic analysis.
Known Limitation: Event deduplication is not implemented. Duplicate events may occur if the same files are re-read (e.g., hourly refresh). This will be addressed in a follow-up issue along with the existing file tracking TODO.
Related issue number
Closes #4380
Checks